Search CORE

9 research outputs found

Korpus beranotasi: ke arah pengembangan korpus bahasa-bahasa di Indonesia

Author: Dinakaramani Arawinda
Suhardijanto Totok
Publication venue: 'Badan Pengembangan dan Pembinaan Bahasa'
Publication date: 01/01/2018
Field of study

Meskipun dikenal sebagai negara dengan keragaman bahasa dan budaya terbesar kedua di dunia setelah Papua Nugini, ironisnya Indonesia juga dikenal sebagai negara yang minim sumber daya bahasa elektronis. Ethnologue (Simons and Fennig 2018) menyebutkan bahwa terdapat 719 bahasa daerah di Indonesia yang tentu saja akan memakan waktu dan biaya untuk membangun sumber daya bahasa (SDB) untuk kesemuanya

Repositori Institusi Kemendikbud

Indonesian lexical bundles in research articles: Frequency, structure, and function

Author: Budiwiyanto Adi
Suhardijanto Totok
Publication venue: 'Universitas Pendidikan Indonesia (UPI)'
Publication date: 18/10/2020
Field of study

Recent studies show that lexical bundles in English are pervasively found in academic discourse. In addition, the characteristics of lexical bundles found vary and differ across registers and genres. Nevertheless, it is still interesting to carry out in languages other than English. This study aims to discover the characteristics of Indonesian lexical bundles that cover frequency, structure, and function in research articles. This study adopted a mixed-method. Identification of the lexical bundle was carried out using WordSmith 7.0 on a corpus comprising 3,125,546 words, taken from 1126 texts, and consisting of six disciplines. With a frequency threshold of 40 per million words and a minimum distribution of 5 texts, 197 lexical bundles have been obtained, consisting of three- to six-word bundles with a total occurrence of 51,813 times. In terms of structure, the incomplete structure is dominating the bundles by 78.7%, with a total frequency of occurrence 38,749 times. This research finds that the pattern of lexical bundles can be classified into five types: noun-based, prepositional-based, verb-based, adjective-based, and clause-based bundles. Lexical bundles in research articles are generally clause-based (49.2%). This indicates that Indonesian lexical bundles vary in structure. The use of clause fragments and passive verbs are the main features in this genre. In terms of the discourse function, research-oriented bundles are the functions that are commonly used, while participant-oriented bundles are the least. Each discourse function has its own structural characteristics. It is also found that one lexical bundle can have two functional categories. These findings contribute to a better understanding of the characteristics of written academic discourse. From the pedagogical point of view, the ﬁndings can be used as learning material for both native and non-native speakers

Indonesian Journal of Applied Linguistics

Dictionary 4.0: Alternative Presentations for Indonesian Multilingual Dictionaries

Author: Nasution Arbi Haza
Suhardijanto Totok
Publication venue
Publication date
Field of study

Building a multilingual dictionary for 719 languages in Indonesia is a challenging task. We have developed application to create the Leipzig-Jakarta list database for all indigenous languages in Indonesia. The database can be used to generate lexical similarity or lexical distance matrix between languages by comparing the word list. For starter, we covered 11 languages: Indonesian, Javanese, Sundanese, Madurese, Bima, Ternate, Tidore, Palembang Malay, Mandailing Batak, Malay, and Minangkabau. The application has two main features: exploring the existing translations and adding translations to a new language or editing existing translations through crowdsourcing. User acceptance test showed 3.48/4 score

Repository Universitas Islam Riau

Dictionary 4.0: Alternative Presentations for Indonesian Multilingual Dictionaries

Author: Nasution Arbi Haza
Suhardijanto Totok
Publication venue
Publication date
Field of study

Repository Universitas Islam Riau

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

Author: Aiton Grant
Ambridge Ben
Ataman Duygu
Ate Yustinus Ghanggo
Barta Botond
Bayyr-ool Aziyana
Bernardy Jean-Philippe
Chodroff Eleanor
Coler Matt
Cotterell Ryan
Ek Adam
El-Khaissi Charbel
Ganieva Sofya
Gasser Michael
Goldman Omer
Habash Nizar
Hatcher Richard J.
Hulden Mans
Ivanova Sardana
Khalifa Salam
Kieraś Witold
Klyachko Elena
Krizhanovsky Andrew
Krizhanovsky Natalia
Kumar Ritesh
Lakatos Dorina
Lane William
Leonard Brian
Liu Zoey
Mielke Sabrina J.
Montoya Samame Jaime Rafael
Nicolai Garett
Nuriah Zahroh
Oncevay Arturo
Pimentel Tiago
Plugaryov Matvey
Ponti Edoardo M.
Prud'hommeaux Emily
Raj Mohit
Ratan Shyam
Ryskina Maria
Salchak Aelita
Salehi Ali
Shcherbakov Andrey
Sheifer Karina
Silva Villegas Gema Celeste
Stoehr Niklas
Straughn Christopher
Suhardijanto Totok
Szolnok Gábor
Tyers Francis M.
Vania Clara
Vylomova Ekaterina
Washington Jonathan
Woliński Marcin
Wu Shijie
Yarowsky David
Ács Judit
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2021
Field of study

This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

Edinburgh Research Explorer

Helsingin yliopiston digitaalinen arkisto

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen